The point of TDT is that you act as if you were deciding, not just on your own behalf, but on behalf of all agents sufficiently identical to you.
It has always seemed to me that the same decisions should be obtainable from ordinary decision theory, if you genuinely take into account the uncertainty about who and what you are. There are many possible worlds containing an agent whose experience is subjectively indistinguishable from yours; an idealized rationality, applied to an agent in your subjective situation, would actually assign some probability to each of those possibilities; and hence, the agents in all those worlds “should” make the same decision (but won’t, because they aren’t all ideally rational). There remains the question of whether the higher payoff that TDT obtains in certain extreme situations can also be derived from this more conventional style of reasoning, or whether it requires some additional heuristic. In this regard, one should remember that, if we are to judge the rationality of a decision theory by payoffs obtained (“rationalists should win”), whether a heuristic is best or second-best may depend on the context (e.g. on the prior).
So let’s consider the present context. It seems that the two agents that are supposed to coordinate, using TDT, in order to avoid a supposedly predictable punishment by a FAI in the future, are yourself now and yourself in the future. We could start by asking whether these two agents are really similar enough for TDT to even apply. To repeat my earlier observations: just because a situation exists in which a particular heuristic for action produces an effective coordination of actions across distances of space and time, and therefore a higher payoff, does not mean that the heuristic in question is generally rational, or that it is a form of timeless decision theory. To judge whether the heuristic is rational, as opposed to just being lucky, we would need to establish that it has some general applicability, and that its effectiveness can be deduced by the situated agent. To judge whether employing a particular counterintuitive heuristic amounts to employing TDT, we need to establish that its justification results from applying the principles of TDT, such as “identity, or sufficient similarity, of agents”.
In this case, I would first question whether you-now and you-in-the-future are even similar enough for the principles of TDT to apply. The epistemic situation of the two is completely different: you-in-the-future knows the Singularity has occurred and a FAI has come into being, you-now does not know that either of those things will happen.
I would also question the generality of the heuristic proposed here. Yes, if there will one day be an AI (I can’t call it friendly) which decides to punish people who could have done more to bring about a friendly singularity, then it would be advisable to do what one can, right now, in order to bring about a friendly singularity. But this is only one type of possible AI.
Perhaps the bottom line is, how likely is it that a FAI would engage in this kind of “timeless precommitment to punish”? Because people now do not know what sort of super-AI, if any, the future will actually bring, any such “postcommitments” made by such an AI, after it has come into existence, cannot rationally be expected to achieve any good, in the form of retroactive influence on the past, not least because of the uncertainty about the future AI’s value system! This mode of argument—“you should have done more, because you should have been scared of what I might do to you one day”—could be employed in the service of any value system. Why don’t you allow yourself to be acausally blackmailed by a future paperclip maximizer?
Okay, I get the feeling that I might be completely wrong about this whole thing. But prior to saying “oops”, I’d like my position completely crushed, so I don’t have any kind of loophole or a partial retreat that is still wrong. This means I’ll continue to defend this position.
First of all, I got TDT wrong when I read about it here on lw. Oops. It seems like it is not applicible to the problem. Still I feel like my line of argument holds: If you know that a future FAI will take all actions necessary that lead to its faster creation, you can derive that it will also punish those who knew it would, but didn’t make FAI happen faster.
Yes, if there will one day be an AI (I can’t call it friendly) which decides to punish people who could have done more to bring about a friendly singularity, then it would be advisable to do what one can, right now, in order to bring about a friendly singularity. But this is only one type of possible AI.
I’d call it friendly if it maximizes the expected utility of all humans, and if that involves blackmailing current humans who thought about this, so be it. Consider that the prior probability of a person doing X where X makes FAI happen a minute faster, generating Y additional utility, is 0.25. If this person, pondering the choices of an FAI, including punishing humans who didn’t speed up FAI development, is in the following more probable to do X (say, now 0.5), then the FAI might punish that human (and the human will anticipate this punishment) for up to 0.25 * Y utility for not doing X, and the FAI is still friendly. If the AI, however, decides not to punish that human, then either the human’s model of the AI was incorrect or the human correctly anticipated this behaviour, which would mean that the AI is not 100% friendly since it could have created utility by punishing that human.
The argument that there are many different types of AGI including those which reward those actions other AGIs punish neglects that the probabilities for different types of AI are spread unequally. I, personately, would assign a relatively high value to FAI (higher than a null hypothesis would suggest), so that the expected utilities don’t cancel out. While we can’t have absolute certainty about the actions of a future AGI, we can guess different probabilities for different mind designs. Bipping AIs might be more likely than Freepy AIs because so many people have donated to the fictional Institute on Bipping AI, whereas there is not even a thing such as a Freepy AI research center. I am uncertain about the value system of a future AGI, but not completely. A future paperclip maximizer is a mind design which I would assign a low probability to, and although the many different AGIs out there might together be more probable than FAI, every single one of them is unlikely compared to FAI, and thus, I should work towards FAI.
Where am I wrong? Where is this kind of argument flawed?
If you know that a future FAI will take all actions necessary that lead to its faster creation, you can derive that it will also punish those who knew it would, but didn’t make FAI happen faster.
But punishing them occurs after it has been created, and no action that it performs after it was created can cause it to have been created earlier than it was actually created. Therefore such post-singularity punishment is futile and a FAI would not perform it.
The only consideration in this scenario which can actually affect the time of an FAI’s creation is the pre-singularity fear of people who anticipated post-singularity punishment. But any actual future FAI is not itself responsible for this fear, and therefore not responsible for the consequences of that fear. Those consequences are entirely a product of ideas internal to the minds of pre-singularity people, such as ideas about the dispositions of post-singularity AIs.
Aside from the fact that I already changed my mind and came to the conclusion that an FAI won’t punish, I’d still object: In case we can anticipate an FAI which does not punish, we wouldn’t feel obliged (or be tempted to feel obliged) to speed up its development. That means that an AI would be better off to foreseeably punish people, and if the AI is friendly, then it has a mind design which maximizes the utility functions of humans. If that involves having a mind-design such that people anticipate punishment and thereby speed up its development, so is it. Especially the fact that we know it’s a friendly AI makes it very easy for us to anticipate its actions, which the AI knows as well. This line of argument still holds, the chain breaks at a weaker link.
The point of TDT is that you act as if you were deciding, not just on your own behalf, but on behalf of all agents sufficiently identical to you.
It has always seemed to me that the same decisions should be obtainable from ordinary decision theory, if you genuinely take into account the uncertainty about who and what you are. There are many possible worlds containing an agent whose experience is subjectively indistinguishable from yours; an idealized rationality, applied to an agent in your subjective situation, would actually assign some probability to each of those possibilities; and hence, the agents in all those worlds “should” make the same decision (but won’t, because they aren’t all ideally rational). There remains the question of whether the higher payoff that TDT obtains in certain extreme situations can also be derived from this more conventional style of reasoning, or whether it requires some additional heuristic. In this regard, one should remember that, if we are to judge the rationality of a decision theory by payoffs obtained (“rationalists should win”), whether a heuristic is best or second-best may depend on the context (e.g. on the prior).
So let’s consider the present context. It seems that the two agents that are supposed to coordinate, using TDT, in order to avoid a supposedly predictable punishment by a FAI in the future, are yourself now and yourself in the future. We could start by asking whether these two agents are really similar enough for TDT to even apply. To repeat my earlier observations: just because a situation exists in which a particular heuristic for action produces an effective coordination of actions across distances of space and time, and therefore a higher payoff, does not mean that the heuristic in question is generally rational, or that it is a form of timeless decision theory. To judge whether the heuristic is rational, as opposed to just being lucky, we would need to establish that it has some general applicability, and that its effectiveness can be deduced by the situated agent. To judge whether employing a particular counterintuitive heuristic amounts to employing TDT, we need to establish that its justification results from applying the principles of TDT, such as “identity, or sufficient similarity, of agents”.
In this case, I would first question whether you-now and you-in-the-future are even similar enough for the principles of TDT to apply. The epistemic situation of the two is completely different: you-in-the-future knows the Singularity has occurred and a FAI has come into being, you-now does not know that either of those things will happen.
I would also question the generality of the heuristic proposed here. Yes, if there will one day be an AI (I can’t call it friendly) which decides to punish people who could have done more to bring about a friendly singularity, then it would be advisable to do what one can, right now, in order to bring about a friendly singularity. But this is only one type of possible AI.
Perhaps the bottom line is, how likely is it that a FAI would engage in this kind of “timeless precommitment to punish”? Because people now do not know what sort of super-AI, if any, the future will actually bring, any such “postcommitments” made by such an AI, after it has come into existence, cannot rationally be expected to achieve any good, in the form of retroactive influence on the past, not least because of the uncertainty about the future AI’s value system! This mode of argument—“you should have done more, because you should have been scared of what I might do to you one day”—could be employed in the service of any value system. Why don’t you allow yourself to be acausally blackmailed by a future paperclip maximizer?
Okay, I get the feeling that I might be completely wrong about this whole thing. But prior to saying “oops”, I’d like my position completely crushed, so I don’t have any kind of loophole or a partial retreat that is still wrong. This means I’ll continue to defend this position.
First of all, I got TDT wrong when I read about it here on lw. Oops. It seems like it is not applicible to the problem. Still I feel like my line of argument holds: If you know that a future FAI will take all actions necessary that lead to its faster creation, you can derive that it will also punish those who knew it would, but didn’t make FAI happen faster.
I’d call it friendly if it maximizes the expected utility of all humans, and if that involves blackmailing current humans who thought about this, so be it. Consider that the prior probability of a person doing X where X makes FAI happen a minute faster, generating Y additional utility, is 0.25. If this person, pondering the choices of an FAI, including punishing humans who didn’t speed up FAI development, is in the following more probable to do X (say, now 0.5), then the FAI might punish that human (and the human will anticipate this punishment) for up to 0.25 * Y utility for not doing X, and the FAI is still friendly. If the AI, however, decides not to punish that human, then either the human’s model of the AI was incorrect or the human correctly anticipated this behaviour, which would mean that the AI is not 100% friendly since it could have created utility by punishing that human.
The argument that there are many different types of AGI including those which reward those actions other AGIs punish neglects that the probabilities for different types of AI are spread unequally. I, personately, would assign a relatively high value to FAI (higher than a null hypothesis would suggest), so that the expected utilities don’t cancel out. While we can’t have absolute certainty about the actions of a future AGI, we can guess different probabilities for different mind designs. Bipping AIs might be more likely than Freepy AIs because so many people have donated to the fictional Institute on Bipping AI, whereas there is not even a thing such as a Freepy AI research center. I am uncertain about the value system of a future AGI, but not completely. A future paperclip maximizer is a mind design which I would assign a low probability to, and although the many different AGIs out there might together be more probable than FAI, every single one of them is unlikely compared to FAI, and thus, I should work towards FAI.
Where am I wrong? Where is this kind of argument flawed?
But punishing them occurs after it has been created, and no action that it performs after it was created can cause it to have been created earlier than it was actually created. Therefore such post-singularity punishment is futile and a FAI would not perform it.
The only consideration in this scenario which can actually affect the time of an FAI’s creation is the pre-singularity fear of people who anticipated post-singularity punishment. But any actual future FAI is not itself responsible for this fear, and therefore not responsible for the consequences of that fear. Those consequences are entirely a product of ideas internal to the minds of pre-singularity people, such as ideas about the dispositions of post-singularity AIs.
Aside from the fact that I already changed my mind and came to the conclusion that an FAI won’t punish, I’d still object: In case we can anticipate an FAI which does not punish, we wouldn’t feel obliged (or be tempted to feel obliged) to speed up its development. That means that an AI would be better off to foreseeably punish people, and if the AI is friendly, then it has a mind design which maximizes the utility functions of humans. If that involves having a mind-design such that people anticipate punishment and thereby speed up its development, so is it. Especially the fact that we know it’s a friendly AI makes it very easy for us to anticipate its actions, which the AI knows as well. This line of argument still holds, the chain breaks at a weaker link.